MSGSU
ISTATISTIK BOLUMU - R ILE ISTATISTIKSEL PROGRAMLAMA DERS NOTLARI
by ozge.ozdamar@msgsu.edu.tr is licensed under a
Creative
Commons Attribution-NonCommercial-ShareAlike 4.0 International
License.
Hata ve öneriler için emoji::("email") ***
LIBRARIES
.packages = c("dplyr", "hflights", "gapminder","kableExtra")
.inst <- .packages %in% installed.packages()
if(length(.packages[!.inst]) > 0) install.packages(.packages[!.inst])
lapply(.packages, require, character.only=TRUE)
REFERENCES
The dplyr package was developed by Hadley Wickham of
RStudio and is an optimized and distilled version of his
plyr package. The dplyr package does not
provide any “new” functionality to R per se, in the sense that
everything dplyr does could already be done with base R,
but it greatly simplifies existing functionality in R.
The dplyr package contains five key data manipulation functions, also called verbs:
| verb | |
|---|---|
select() |
returns a subset of the columns, |
filter() |
returns a subset of the rows, |
arrange() |
reorders the rows according to single or multiple variables, |
mutate() |
add columns from existing data, |
summarise() |
which reduces each group to a single row by calculating aggregate measures. |
Common dplyr Function Properties
All of the functions that we will discuss will have a characteristics.
data.frame
library(hflights)
data(hflights)
head(hflights)
str(hflights)
## 'data.frame': 227496 obs. of 21 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
is.data.frame(hflights)
## [1] TRUE
is.tbl(hflights)
## [1] FALSE
select( )select() allows you to choose the columns of interest
out of your data frame.
select(df, var1, var2)
select(df, 1:4, -2)
select() does not change the data frame; you have to
assign the result to a variable to store.ex: select first 3 variables of hflights
names(hflights)
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
head(select(hflights,Year, Month, DayofMonth))
head(select(hflights, 1:3))
head(select(hflights,Year:DayofMonth))
ex:select Airtime and ArrDelay
head(select(hflights, AirTime, ArrDelay))
ex: view first 10 observations of AirTime, ArrDelay and DepDelay
head(select(hflights, AirTime, ArrDelay, DepDelay),10)
ex: view first 10 observations of the variables between Origin and Cancelled
head(select(hflights, Origin:Cancelled),10)
inspect:
head(select(hflights, - (DepTime:AirTime)),10)
dplyr comes with a set of helper functions that can help you select
groups of variables inside a select() call:
| function | |
|---|---|
starts_with("X") |
every name that starts with “X”, |
ends_with("X") |
every name that ends with “X”, |
contains("X") |
every name that contains “X”, |
matches("X") |
every name that matches “X”, where “X” can be a regular expression, |
num_range("x", 1:5) |
the variables named x01, x02, x03, x04 and x05, |
one_of(x) |
every name that appears in x, which should be a character vector. |
ex: view first 10 observations of ArrDelay and DepDelay
head(select(hflights, ends_with("Delay")),10)
ex: view first 10 observations of UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode
head(select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel")),10)
ex: view first 10 observations of DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay and DepDelay
head(select(hflights, contains("Tim"), contains("Del")),10)
r base and dplyr difference
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, contains("Taxi"), Distance)
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year:ArrTime, -DayofMonth)
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, starts_with("T"))
filter( )select rows
filtered.rows <- filter(DATAFRAME, CONDITION-1, CONDITION-2, ..., CONDITION-N)
ex: 3000 ve daha fazla mil yapilan butun ucuslari listeleyin (Distance degiskeni)
filter(hflights, Distance >= 3000)
Q 15: JetBlue, Southwest ya da Delta tarafinda gerceklestirilen tum ucuslari listeleyin (UniqueCarrier degiskeni)
filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))
Q 16: filter() kullanarak TaxiIn ve TaxiOut toplami ucus suresinden(AirTime) fazla olan ucuslari listeleyin
filter(hflights, TaxiIn + TaxiOut > AirTime)
boolean operators in filter():
& (and),| (or),! (not)Instead of the & operator, you can pass several
logical tests to filter(), separated by commas.
select rows
filter(df, a > 0 & b > 0)
filter(df, a > 0, b > 0)
is.na() keeps the observations in df for which the variable
x is not NA:
filter(df, !is.na(x))
Q 17: 5:00 am den once (DepTime 500) kalkan veya 10:00 pm den sonra(ArrTime 2200) gelen ucuslari listeleyin
filter(hflights, DepTime < 500 | ArrTime > 2200)
Q 18: Gec kalkan (DepDelay, delay >0) fakat beklenen sureden once inis yapan(ArrDelay >0) ucuslari listeleyin
filter(hflights, DepDelay > 0, ArrDelay < 0)
Q 19: Gec kalktigindan (DepDelay > 0) iptal edilen (Canceled 1 olmali) ucuslari listeleyin
filter(hflights, Cancelled == 1, DepDelay > 0)
Q 20: JFK havaalanina(Dest JFK olmali) inis yapilacak ucuslari c1 olarak kaydedin
c1 <- filter(hflights, Dest == "JFK")
Q 21: c1 e Date isimli yeni degisken ekleyin ve c2 olarak kaydedin. eklenen degisken Yil-Ay-G??n formatinda olsun (Year, Month, DayofMonth)
c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))
Q 22: c2 veri setinden srasiyla Date, DepTime, ArrTime ve TailNum degiskenlerini ekrana yazdirin
select(c2, Date, DepTime, ArrTime, TailNum)
How many weekend flights flew a distance of more than 1000 miles but had a total taxiing time below 15 minutes?
Use the dataset in package gapminder for the
exercises
gapminder::gapminder
filter(gapminder, country=="Canada", year < 1970)
filter(gapminder,
country == "Canada" |
country == "Algeria")
filter(gapminder,
country %in% c("Canada", "Algeria"))
filter(gapminder,
country %in% c("Canada", "Algeria"), year < 1970, year >= 1960)
filter(gapminder,
(country == "Canada") |
(country == "Algeria" &
year %in% 1960:1969))
filter(gapminder,
continent != "Europe")
gapminder %>%
filter(country %in% c("Canada", "Algeria"),
year <= 1969, year >= 1960) %>%
select(country, year, gdpPercap)
Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).
The pipe operator takes the thing on the left-hand-side and pipes it into
the function call on the right-hand-side ??? literally, drops it in as the first argument.
head(gapminder)
gapminder %>% head()
select(gapminder, year, lifeExp)
gapminder %>%
select(year, lifeExp) %>%
head(4)
Q1: UniqueCarrier degiskeninde kodlanmis bulunan sirket isimlerinin tam hallerini Carrier degiskeni olarak olusturarak veri setine ekleyelim
lookup table When you subset the lookup table with a character string R will return the values of the lookup table that correspond to the names in the character string.
a <- c("AA", "AS")
b <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue")
a <- b[a]
a
## AA AS
## "American" "Alaska"
add a new variable Carrier, with lookup table UniqueCarrier
isim <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
Add the Carrier column to hflights
hflights$Carrier <- isim[hflights$UniqueCarrier]
Glimpse at hflights
dplyr::glimpse(hflights)
## Rows: 227,496
## Columns: 22
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355, 1443…
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454, 1554…
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 428, 42…
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492AA", "N…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56, 63, …
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41, 44, …
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5, -9, …
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2, -3, …
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IA…
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF…
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 22…
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, 12, 8,…
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, 13, 15…
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Carrier <chr> "American", "American", "American", "American", "Ame…
Q2: Benzer sekilde CancellationCode degiskeni, isim vektoruyle verilen degerler ile Code degiskeni olarak veri setine eklensin. (NA ornegi)
head(unique(hflights$CancellationCode),10) # unik degerleri gosterir
## [1] "" "A" "B" "C" "D"
isim <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
hflights$Code <- isim[hflights$CancellationCode]
head(hflights)
mutate( )The mutate function allows you to create additional columns for your data frame opposite of select
more.columns <- mutate(DATAFRAME, new.column1 = old.column * 2, new.column2 = old.column * 3 )
mutate() even allows you to use a new variable while
creating a next variable in the same call. In this example, the new
variable x is directly reused to
create the new variable y: mutate(my_df, x = a + b, y = x + c)
Q9: ActualElapsedTime ile AirTime fark??n?? ActualGroundTime degiskeni olarak hflights a ekleyen g1 isimli data.frame olusturun
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
Q10: g1 e GroundTime degiskenini ekleyin (TaxiIn ile TaxiOut degiskenlerinin toplami) ve g2 olarak kaydedin. GroundTime ve ActualGroundTime degiskenlerinin ayni oldugunu gozlemleyin.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)
head(select(g2, GroundTime, ActualGroundTime),10)
Q11: g2 ye AverageSpeed degiskenini (Distance / AirTime * 60) ekleyin vw g3 olarak kaydedin.
g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)
Q12: m1 i hflights a loss (ArrDelay - DepDelay) ve loss_ratio (ratio of loss to DepDelay) degiskenlerini ekleyerek olusturun
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)
Q13: m2 isimli tblyi 3 degisken ekleyerek olusturun: TotalTaxi: TaxiIn ve TaxiOut toplami ActualGroundTime: ActualElapsedTime ve AirTime farki Diff: yeni olusturulan 2 degiskenin farki Diff degiskeninin t??m degerlerinin 0 oldugunu gorun
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut,
ActualGroundTime = ActualElapsedTime - AirTime,
Diff = TotalTaxi - ActualGroundTime)
transmutate() drops the variable
head(g1)
head(transmute(g1,NewMonth = DayofMonth + 1),10)
Use the dataset in package gapminder for the
exercises
gapminder::gapminder
mutate(gapminder, gdp = gdpPercap * pop)
gapminder %>%
mutate(gdp = gdpPercap * pop)
mutate(gapminder,
gdp = gdpPercap * pop,
gdpBill = round(gdp/1000000000, 1))
gapminder %>%
mutate(gdp = gdpPercap * pop,
gdpBill = round(gdp/1000000000, 1))
mutate(gapminder, cc = paste(country, continent, sep=", "))
gapminder %>%
mutate(cc = paste(country, continent, sep=", "))
arrange()Arranging your data, as sorting rows. defaut asc
dtc
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))
Arrange dtc by departure delays
head(arrange(dtc, DepDelay),10)
head(arrange(dtc, CancellationCode),10)
head(arrange(dtc, UniqueCarrier, DepDelay),10)
head(arrange(hflights, UniqueCarrier, desc(DepDelay),10)
head(arrange(hflights, DepDelay + ArrDelay),10)
head(dplyr::arrange(gapminder, year, desc(lifeExp)),10)
ex
gapminder %>%
arrange(year, desc(lifeExp)) %>%
select(year, lifeExp, everything())
summarise()summarise() (summarize is also accepted) creates a
summary of a column,computing a single value from multipe values. use
group_by() splits the tibble into parts
#summarize(gapminder, mean_pop=mean(pop), sd_pop=sd(pop))
#gapminder %>%
# group_by(country) %>%
# head(summarize(mean_pop=mean(pop), sd_pop=sd(pop)))
group_by(gapminder, continent, country, year < 1970)
(out1 <- gapminder %>%
group_by(continent, country, year < 1970) %>%
summarize(mean_pop=mean(pop), sd_pop=sd(pop)))
out1 %>%
summarize(mean_pop=mean(mean_pop))
gapminder %>%
group_by(country) %>%
summarize(mingdp=min(gdpPercap))
gapminder %>%
group_by(country) %>%
summarize(n_distinct(year))
gapminder %>%
filter(continent=="Asia") %>%
group_by(year) %>%
summarize(minexp=min(lifeExp),
maxexp=max(lifeExp))
https://stat545.com/join-cheatsheet.html
https://dplyr.tidyverse.org/articles/two-table.html
https://r4ds.had.co.nz/relational-data.html
A key is a column or combination of columns, that occurs in each of the tables that you want to join.
when your key appears in your first table, it is called a primary key, because dplyr treats the first table as your primary table.
when the key appears in the second table, it is called secondary table, or foreign key.
this difference is important, because the key has a special job in the primary table: the primary key should uniquely identify each row in the first dataset
a<-data.frame(x1=c("A", "B","C"), x2= c(1,2,3))
b<-data.frame(x1=c("A","B","D"), x2=c("T", "F", "T"))
band_members
band_instruments
left_join(a, b): Return all rows from a,
and all columns from a and b. If there are
multiple matches between a and b, all
combination of the matches are returned. This is a mutating join.
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
left_join(a, b, by="x1")
left_join(a, b, by="x1", suffix=c("_1","_2"))
left_join(band_members, band_instruments, by="name")
right_join(a,b,by="x1")
inner_join(x, y): Return all rows from x
where there are matching values in y, and all columns from
x and y. If there are multiple matches between
x and y, all combination of the matches are
returned. This is a mutating join.
inner_join(a,b,by="x1")
full_join(a,b,by="x1")
semi_join(a,b,by="x1")
anti_join(a,b,by="x1")
How many flights flew to AUS in 2013?
hflights %>%
filter(Dest == "AUS",Year==2013)
How many flights flew to AUS in first week of January?
# Comma separated conditions are combined with '&'
hflights %>%
filter(Dest == "AUS", Month == 1, DayofMonth <= 7)
destination to AUS, ORD and TUL
hflights %>%
filter(Dest == "AUS" | Dest == "ORD" | Dest == "TUL")
hflights %>%
filter(is.element(Dest, c("AUS", "ORD", "TUL")))
Find longest delayed flights to AUS.
hflights %>%
filter(Dest == "AUS") %>%
arrange(desc(DepDelay))
table joining
import sets and themes datasets from legodataset
add theme information to sets table sets n= 11673 themes n = 614
library(readr)
sets<-read_csv("sets.csv")
## Rows: 11673 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): set_num, name
## dbl (3): year, theme_id, num_parts
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
themes<-read_csv("themes.csv")
## Rows: 614 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (2): id, parent_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
a<-sets %>%
inner_join(themes, by=c("theme_id" = "id"))
head(a)
a<-sets %>%
inner_join(themes, by=c("theme_id" = "id"), suffix = c("_set", "_theme"))
head(a)
a<-sets %>%
inner_join(themes, by=c("theme_id" = "id"), suffix = c("_set", "_theme")) %>%
count(name_theme, sort = TRUE)
load parts and part_categories tables
parts<-read_csv("parts.csv")
## Rows: 25993 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): part_num, name
## dbl (1): part_cat_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
part_categories<-read_csv("part_categories.csv")
## Rows: 57 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"))
parts %>%
inner_join(part_categories, by = c("part_cat_id" = "id"),suffix=c("_part", "_category"))
joining one to many relationships import inventories table
inventories<-read_csv("inventories.csv")
## Rows: 11681 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): set_num
## dbl (2): id, version
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sets %>%
inner_join(inventories, by="set_num")
sets %>%
inner_join(inventories, by="set_num")
inventories %>%
inner_join(sets, by="set_num")
sets %>%
inner_join(inventories, by="set_num") %>%
inner_join(themes, by = c( "theme_id" = "id"))
sets %>%
inner_join(inventories, by="set_num") %>%
inner_join(themes, by = c( "theme_id" = "id"), suffix = c( "_set", "_theme"))
OYKK2020-R
ile Veri Önişleme by ozge.ozdamar@msgsu.edu.tr is licensed under a
Creative
Commons Attribution-NonCommercial-ShareAlike 4.0 International
License.